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Abstract 

This paper introduces a coordinate descent version of the Vu-Condat aigorithm. By coordinate 
descent, we mean that oniy a subset of the coordinates of the primai and duai iterates is updated at 
each iteration, the other coordinates being maintained to their past value. Our method allows us to solve 
optimization problems with a combination of differentiable functions, constraints as well as non-separable 
and non-differentiable regularizers. 

We show that the sequences generated by our algorithm converge to a saddle point of the problem 
at stake, for a wider range of parameter values than previous methods. In particular, the condition on 
the step-sizes depends on the coordinate-wise Lipschitz constant of the differentiable function’s gradient, 
which is a major feature allowing classical coordinate descent to perform so well when it is applicable. 

We illustrate the performances of the algorithm on a total-variation regularized least squares regression 
problem and on large scale support vector machine problems. 


1 Introduction 

We consider the optimization problem 


inf f{x) + g{x) + h{Mx) (1) 

where X is a Euclidean space, M : X —>■ y is a linear operator onto a second Euclidean space F; functions 
f : X ^ R, g : X ^ {—oo, -boo] and h : y ^ (— 00 , -boo] are assumed proper, closed and convex; the 
function / is moreover assumed differentiable. We assume that X and y are product spaces of the form 
X — Xi X ■■■ X Xn and jV = Fi x ■ ■ • x for some integers n,p. For any a; G df, we use the notation 
X = {x^^\ ... ,x^^'>) to represent the (block of) coordinates of x (similarly for y = ... ,in F). 

Problem Q has numero us applica tions e.g. in machine learning [CBS14] . image processing |CCC~*~10] or 
distributed optimization |BPC+lf) . 

Under the standard qualification condition 0 G Ti{Mdoing — domh) (where dom and ri respectively stand 
for domain and relative interior), a point a; G bf is a minimizer of 0 if and only if there exists y € y such 
that (a;, y) is a saddle point of the Lagrangian function 

L{x, y) = fix) + g{x) + {y, Mx) - h*iy) 

where (.,.) is the inner product and h* : y ^ snp^^y{y^z) — h(z) is the Fenchel-Legendre transform 
of h. There is a rich literature on primal-dual algorithms searching for a saddle point of L (see |TDC14| and 
references therein). In the special case where f = 0, the alternating direction method of multipliers (ADMM) 
proposed by Glowinsky and Marroco |GM75| . Gabay and Mercier [GM76| and the algorithm of Chambolle 
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and Pock [CPTT1 are amongst the most celebrated ones. In order to encompass the presence of a nonzero 
smooth function /, Combettes and Pesquet proposed a primal-dual splitting algorithm which, in the case of 
Problem Q, involves two calls of the gradient of / at each iteration [CP12) . Hence, function / is handled 
explicitely in the sense that the algorithm does not involve, for instance, the call of a proximity operator 
associated with /. Based on an elegant idea also used in |HY12) . Vu |Vul3| and Condat |Conl3| separately 
proposed a primal-dual algorithm allowing as well to handle / explicitely, and requiring one evaluation of 
the gradient of / at each iteration. A convergence rate analysis is provided in |CP14a] (see also |TDC14) b 
A related splitting method has been recently introduced by [DY15] . 

This paper introduces a coordinate descent (CD) version of the Vu-Condat algorithm. By coordinate 
descent, we mean that only a subset of the coordinates of the primal and dual iterates is updated at 
each iteration, the other coordinates being maintained to their past value. Coordinate descent was his¬ 
torically used in the context of coordinate-wise minimization of a unique function in a Gauss-Seidel sense 
[War63| |BT89) [TMOlj . Tseng et al. |LT02| |TY) [TY10| and Nesterov |Nesl2] developped CD versions of 
the gradient descent. In [Nesl2| as well as in this paper, the updated coordinates are randomly chosen at 
each iteration. The algorithm of [NesI2) has at least two interesting features. Not only it is often easier 
to evaluate a single coordinate of the gradient vector rather than the whole vector, but the conditions un¬ 
der which the CD version of the algorithm is provably convergent are generally weaker than in the case of 
standard gradient descent. The key point is that the step size used in the algorithm when updating a given 
coordinate i can be chosen to be inversely proportional to the coordinate-wise Lipschitz constant of V/ along 
its ith coordinate, rather than the global Lipschitz constant of V/ (as would be the case in a standard gradi¬ 
ent descent). Hence, the introduction of coordinate descent allows to use larger step sizes which potentially 
results in a more attractive performance. The random CD gradient descent of [NesI2) was later generalized 
by Richtarik and Takac |IITI4] to the minimization of a sum of two convex functions f g (that is, h = 0 
in problem 0 ). The algorithm of [RT14| is analyzed under the additional assumption that function g is 
separable in the sense that for each x G X, g{x) = some functions gi : Xi ^ (— c»,-|-oo]. 

Accelerated and parallel versions of the algorithm have been later developed by |RT12j |RT15| |FRI3| always 
assuming the separability of g. 

In the literature, several papers seek to apply the principle of coordinate descent to primal-dual algo¬ 
rithms. In the case where f = 0, h is separable and smooth and g is strongly convex, Zhang and Xiao [ZX14] 
introduce a stochastic CD primal-dual algorithm and analyze its convergence rate (see also [SuzI4| for re¬ 
lated works). In 2013, lutzeler et al. |IBCH13] proved that random coordinate descent can be successfully 
applied to fixed point iterations of firmly non-expansive (FNE) operators. Due to |Gab83| . the ADMM can 
be written as a fixed point algorithm of a FNE operator, which led the authors of |IBCH13] to propose a 
coordinate descent version of ADMM with application to distributed optimization. The key idea behind the 
convergence proof of |IBCH13] is to establish the so-called stochastic Fejer monotonicity of the sequence of 
iterates as noted by |CP14b| . In a more general setting than |IBCH13] . Combettes et al. in |CP14b| and 
Bianchi et al. |BHI14| extend the proof to the so-called a-averaged operators, which include FNE operators 
as a special case. This generalization allows to apply the coordinate descent principle to a broader class 
of primal-dual algorithms which is no longer restricted to the ADMM or the Douglas Rachford algorithm. 
For instance, Forward-Backward splitting is considered in |CPI4bj and the Vu-Condat algorithm is consid¬ 
ered in |BHII4[ IPR I4) . However, the approach of |IBCH13] , |CPI4b) , |BHII4) which is based on stochastic 
Fejer monotonicity, has a major drawback: the convergence conditions are identical to the ones of the brute 
method, the one without coordinate descent. These conditions involve the global Lipschitz constant of the 
gradient V/ instead than its coordinate-wise Lipschitz constants. In practice, it means that the application 
of coordinate descent to primal-dual algorithm as suggested by |CPI4bj and |BHI14j is restricted to the use 
of potentially small step sizes. One of the major benefits of coordinate descent is lost. 

In this paper, we introduce a CD primal-dual algorithm with a broader range of admissible step sizes. 
The algorithm is introduced in Section At each iteration k, an index i is randomly chosen w.r.t. the 
uniform distribution in {I,..., n} where n is, as we recall, the number of primal coordinates. The coordinate 
(i) 

x). of the current primal iterate Xk is updated, as well as a set of associated dual iterates. Under some 
assumptions involving the coordinate-wise Lipschitz constants of V/, the primal-dual iterates converges to 
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a saddle point of the Lagrangian. As a remarkable feature, our CD algorithm makes no assumption of 
separability of the functions /, g or h. In the special case where h = 0 and g is separable, the algorithm 
reduces to the CD proximal gradient algorithm of |RT14| . The convergence proof is provided in Section]^ It 
is worth noting that, under the stated assumption on the step-size, the stochastic Fejer monotonicity of the 
sequence of iterates, which is the key idea in [IBCHlSj . |CPI4b| . |BHII4) . does not hold (a counter-example 
is provided). Our proof relies on the introduction of an adequate Lyapunov function. In Section the 
proposed algorithm is instanciated to the case of total-variation regularization and support vector machines. 
Numerical results performed on real IRM and text data establish the attractive behavior of the proposed 
algorithm and emphasize the importance of using primal-dual CD with large step sizes. 


2 Coordinate Descent Primal-Dual Algorithm 

2.1 Notation 

We note M = {Mj^i : z £ {1,..., n}, j £ {1, ■ ■ ■ ,p}) where Mj^i : Xi yj are the block components of M. 
For each j £ {1,... ,p}, we introduce the set 

I{j) = G {1,.. .,n} : M,,, ^ o|. 

Otherwise stated, the jth component of vector Mx only depends on x through the coordinates such that 
i £ I{j)- We denote by 

rrij = card(/(j)) 

the number of such coordinates. Without loss of generality, we assume that ruj ^ 0 for all j. For all 
z £ { 1 ,..., n}, we define 

J(z) = £ { 1 ,... ,p} : Mj,, ^ o|. 

Note that for every pair (z,j), the statements z £ I{j) and j £ J(z) are equivalent. 

Recall that y = yi x • • • x y^. For every j £ {!,... ,p}, we use the notation yj = An arbitrary 

element u in yj will be represented by it = {u{i) : i £ I{j)). We define x • • • x yp. An arbitrary 

element y in will be represented as y = {y^^\ ■ ■. These notations are recalled in Tablebelow. 


Space 

Element 

X = Xix ■■■ X X„ 

X = (xb) : z £ {1,..., rz}) 

y = yix---xyp 

y = ■■j G {!,...,p}) 


u = (u(i) : i G /(j)) 

>; = X • • • X 

y = - j G {!,...,p}) 

where ybO = {y^^\i) : z G d(j)) Vj 


Table 1: Standing notations. 

If (. is an integer, 7 = ( 71 ,... , 7 ^) is a collection of positive real numbers and A = Ai x • • • x is 

an arbitrary product of Euclidean spaces, we introduce the weighted norm || . on A given by ||zz||^ = 

Si=i every u = ..., where || . lU- stand for the norm on Ai- If F : A —>■ (— 00 , -foo] 

denotes a convex proper lower-semicontinuous function, we introduce the proximity operator prox.^^^ : A —>■ 
A defined for any zz £ A by 

prox p{u) = argmin F{w) -b vHzc - w|P_i 
" wGA Z ' 

where we use the notation 7 “^ = ( 7 )"^,..., 7 ^^). We denote by prox^*^^. : A ^ Ai the zth coordinate 
mapping of prox.^^^ that is, pTOx^ p{u) = (prox^^^^(zz),..., prox^^^^(zz)) for any u G A. The notation ^^( 7 ) 
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(or simply I? (7) when no ambiguity occurs) stands for the diagonal operator on A ^ A given by = 

..., for every u = ..., 

Finally, the adjoint of a linear operator B is denoted B*. The spectral radius of a square matrix A is 
denoted by p{A). 

2.2 Main Algorithm 

Consider Problem Q. Let a = (cri,..., (Tp) and r = (ri,...,r„) be two tuples of positive real numbers. 
Consider an independent and identically distributed sequence (ik : k G N*) with uniform distribution on 
The proposed primal-dual CD algorithm consists in updating four sequences Xk G X, Wk G X, 
Zk Gy and y^. G y. It is provided in Algorithm]^ below. 


Algorithm 1 Coordinate-descent primal-dual algorithm 
Initialization: Choose xg G X, G y. 

For alH e {1,..., n}, set = EjeJ(i) yo’^(*)- 

For all j G {1,... ,p}, set Eie/O) 

Iteration fc: Define: 


Vk+i = prox^ (zfc -K D{a)Mxk) 

Xk+i = pTOx^^g(^Xk - D{t) [Vf{xk) + - Wk) ) . 

For i = ik+i and for each j G J(ife+i), update: 


ph) _ 

•^k + l ~ -^k+l 

Vkliii) = Vkh 

4+1 = 4*^+ ^l^iyklS)-yk\^) 
4ii = 4'^ + ^(yi+iW - yk\i)) ■ 

iflj 

^,1 . , (i) (i) (i) (i) (j) (j) 1 (j) /-\ 

Otherwise, set = xi \ wi^ — ^ ^ ^ ^ 




and y^^l,i^) = y^^^. 


Remark 1. In Algorithm it is worth noting that quantities {xk+i^Vk+i) do not need to be explicitely 
calculated. At iteration fc, only the coordinates 

4 + 1 '^ and Vj G J(h.+i) 

are needed to perform the update. When g is separable, it can be easily checked that other coordinates do 
not need to be computed. From a computational point of view, it is often the case that the evaluation of 
the above coordinates is less demanding than the computation of the whole vectors Xk+i, Vk+i- Practical 
examples are provided in Section]^ 

For every i G {1,..., n}, we denote hy Ui : Xi ^ X the linear operator given by Ui{u) = (0, • • • , 0, u, 0, • • • , 0) 
he., all coordinates of Ui(u) are zero except the ith coordinate which coincides with u. Our convergence 
result holds under the following assumptions. 

Assumption 1. a) The functions f, g, h are closed proper and convex. 

^The results of this paper easily extend to the selection of several primal coordinates at each iteration with a uniform 
samplings of the coordinates, using the techniques introduced in IRT15I . 
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b) The function f is differentiable on X. 

c) For every i G {1,..., n}, there exists /3i > 0 such that for any x G X, any u G Xi, 

f{x + Uiu) < f{x) + {Vf{x),Uiu) + . 

d) The random sequence (ik)kGH* is independent with uniform distribution on {1,... ,n}. 

e ) For every i G 

1 

n < -^ . 

Pi+p 

We denote by S the set of saddle points of the Lagrangian function L. Otherwise stated, a couple 
[x^.y*) G X X y lies in S if and only if it satisfies the following inclusions 


0 e V/(x*) + 5g(a;*) + M*j/* (2) 

0 G —Mx* + dh*{y^). (3) 


We shall also refer to elements of S as primal-dual solutions. 

Theorem 1. Let Assumption^ hold true and suppose that 5 7 ^ 0. Let {xk,yk) be a sequence generated by 
Algorithm^ Almost surely, there exists (x*,j/*) G S such that 

lim Xk = X* 

k—foo 

lim yi^\i) = yi^'’ (Vj G {1,... ,p}, Vi G I{j)). 

k—>-oo 


2.3 Special cases 

2.3.1 The case mi = • • • = nip = 1 

We consider the special case mi = • •• = mp = 1. Otherwise stated, the linear operator M has a single 
nonzero component Mj^i per row j G {1,. .. ,p}. This case happens for instance in the context of distributed 
optimization |BHI14| . 

For each j G {1,.. . ,p}, the vector y^jf'^ is reduced to a single value y^\i) G yj where i is the unique 
index such that Mj^i 7 ^ 0. We simply denote this value by y^\ Algorithm 111 simplihes to Algorithm!^ below. 


2.3.2 The case h — 0 


Instanciating Algorithm in the special case h = 0, it boils down 
algorithm: 

(i) _[ ■proxiig{xk - D(r)V fix,)) 

“ 1 X 


to the following CD forward-backward 


if * = ik+i 
otherwise. 


(4) 


As a consequence, Algorithm allows to recover the CD proximal gradient algorithm of |RT14j with the 
notable difference that we do not assume the separability of g. On the other hand. Assumption [Ij[e|) becomes 
Ti < l/f3i whereas in the separable case, |RT14j assumes = l/A. This remark leads us to conjecture 
that, even though Assumption [Tp| generally allows for the use of larger step sizes as the ones suggested by 
the approach of |CP14b) |BHI14| . one might be able to use even larger step sizes than the ones allowed by 
Theorem ^ 

Note that a similar CD forward-backward algorithm can be found in [CP14b| with no need to require 
the separability of g. However, the algorithm of |CP14b) assumes that the step size Ti (there assumed to 
be independent of i) is less than 2//3 where /3 is the global Lipschitz constant of V/. As discussed in the 
introduction, an attractive feature of our algorithm is the fact that our convergence condition < 1/Pi only 
involves the coordinate-wise Lipschitz constant of V/. 
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Algorithm 2 Coordinate-descent primal-dual algorithm - Case mi = ■ ■ • = mp = 1. 
Initialization: Choose xq G X, yo G y. 

Iteration k: Define: 


Vk+i = {Vk + D{a)Mxk) 

Xk+i = g(xk - D{t){V f{xk) + M*{2yk+i - Vk))'^ ■ 

For i = ik+i and for each j G J{ik+i), update: 


(i) —(i) 


O') _^0) 

yfc+1 ~ yfc+1 


Otherwise, set = x\^\yl^l.^ = y^^\ 


2.4 Failure of stochastic Fejer monotonicity 

As discussed in the introduction, an existing approach to prove convergence of CD algorithm in a general 
setting (that is, not restricted to h = 0 and separable g) is to establish the stochastic Fejer monotonicity 
of the iterates. The idea was used in |IBCH13] and extended by |CP14b| and |BHI14j to a more general 
setting. Unfortunately, this approach implies to select a “small” step size as noticed in the previous section. 
The use of small step size is unfortunate in practice, as it may significantly affect the convergence rate. 

It is natural to ask whether the existing convergence proof based on stochastic Fejer monotonicity can 
be extended to the use of larger step sizes. The answer is negative, as shown by the following example. 
Consider the toy problem 

min -I- x*'^^ -|- x*'^^ — 1)^ 

a:GR3 2 

that is we take /(x) = -l-x^^^ -l-x^^^ — 1)^ and g = h = M = 0. One of the minimizers is x* = 

The global Lispchitz constant of V/ is equal to 3 and the coordinate-wise Lipschitz constants are equal to 
1. The CD proximal gradient algorithm Q writes 

Ji) ^ f + ^k'’ + ^k'* - 1) if * = ^k+1 

I x[,*i otherwise 

where we used ti = r 2 = T 3 = r for simplicity. By Theorem]^ x^ converges almost surely to x* whenever 
r < 1. Setting Xq = 0, one has ||xo — x*|p = |. It is immediately seen that E||xi — x*|p = (r — + i + ^ 

where E represents the expectation. In particular, E||xi — x*|p > ||xo — x*|p as soon as r > 2/3. Therefore, 
the sequence E||xfe —x* p is not decreasing. This example shows that the proof techniques based on monotone 
operators and Fejer monotonicity are not directly applicable in the case of long stepsizes. 

3 Proof of Theorem [l] 

3.1 Preliminary Lemma 

For every (x,y) S A x y, we define V{x,y) = ^\\x\\l.i -t {y,Mx) + \\\y\\l-i ■ 

Lemma 1. Let Assumption [7]Q^ hold true. Let {x,y) G X x y and (x*,?/*) G S. Define 

y = prox^ kx {y + D{a)Mx) 

X = prox^_g(^x - D(t)(V/(x) -k M*{2y-y))'j 
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and set z = (x,y), z* = (a;*,?/*), z = {x,y). Then, 

(V/(a;*) - V/(x), x^-x) +V (z, z) <V{z,z^) -V(z, z*). 

Proof. The inclusions ^ also read 

Vm G -T, g{u) > g{x^) + (—V/(x*) — M*y^, u — xf) 

\/v e y, h{v) > h{y^) + (Mx*, w — y*) . 

Setting u = x and w = y in the above inequalities, we obtain 

9 {x) > g{xy) + (V/(a;,) + M*y,„x^ - x) (5) 

h*{y) > h*{y^) + {Mxt,y - y^). (6) 

By definition of the proximal operator, 

y = argmin/i*(w) - {v,Mx) + ^||u-y||^-i (7) 

v^y z 

X = argming{u) + {u,Wf{x) + M*{2y - y)) + - a;||^-i . (8) 

uGPc Z 

Consider Equality Q above. It classically implies |Tse08] that for any v G y, 

h*iy) - {y,Mx) + ^\\y-y\\l-i < h*{v) - {v,Mx) + ^\\v-y\\l-i - ^Ily-u||^_i . 

Setting u = y*, we obtain 

h*{y) < h*{y^) + {y-y^,Mx) + ^\\y* - y\\l-i - ^\\y - y*\\l-^ - ^\\y - vWl-^ 
and using ([^, we finally have 

(M(a;* -x),y-y^) < ^|ly* -y||^-i - ^WV - y*\\l-^ “ ^WV - y\\l-^ (9) 

Similarly, Equality ([^ implies that for any u G X, 

g{x) + {x,\Jf{x) + XI*{2y-y))x]^\\x-x\\l-i < g{u) + {u,\Jf{x)xM*{2y-y)) + ]^\\u-x\\l-^-]^\\x-u\\l-i. 
We set u = x*. This yields 

g{x) < g{x^) + (x* -x,\/f{x)+M*{2y-y)) + ^\\x,, - x\\l-i - ^\\x - x4l-i - ^||x-x||^-i . 

Using moreover Inequality ([^, we obtain 

(V/(x*)+ M*y*,x*-x) < (x*-x,V/(x)+ M*(2y-y)) + ^||x*-x||^_i - ^||x-x*||^_i - ^||x-x||^_i 
hence, rearranging the terms, 

(V/(x*) - V/(x),x* -x) - i||x* -x||^_i + ^||x-x*||^_i + ^||x-x||^_i < (2y-y-y*,M(x* -x)) . 
Summing the above inequality with Q, 

(V/(x,) - V/(x),x» -x) + ^||x-x||^_i + (y-y,M(x-x)) + ^\\y - y\\l-i 

< ^\\x-x4l-i + {y-y^,M{x-x4) + ^\\y-y4l-i-^\\x-x4l-i-{y-y,„M{x-x4)-^\\y-y4l-i . 
This completes the proof of the lemma thanks to the definition of V. □ 
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3.2 Study of Algorithmic 

We first prove Theorem in the special case mi = • • • = nip = 1. In that case, Algorithm boils down to 
Algorithm]^ We recall that in this case, the vector is reduced to a single value G yj where i is 

the unique index such that Mj^i ^ 0. We simply denote this value by . 

We denote by the filtration generated by the random variable (r.v.) h,-- - ,4- We denote by 

Efc(.) = E(. \Fk) the conditional expectation w.r.t. Fk- 

Lemma 2. Let Assumptions hold true. Suppose mi = • • • = m^ = 1. Consider Algorithm^ and 

let 7 i,... , 7 ji be arbitrary positive coefficients. For every k > 1 and every Fk-measurable pair of random 
variables (A, Y) on X x y, 

E/i;{^/c+l) - “t“ (1 )X/g 

n n 

Efe(||x,+i - A||^) = i||x,+i - Al!^ + (1 - l)||xfc - A||2 
Efe(||yfc+i - All^O = ^||y,+i - + (i - l)||y, - y||2_, 

Efe((yfe+i - r, M(xfe+i - A))) = ^{y„^i - Y, M(xfe+i - A)) + (1 - ^){yk - Y, M{xk - A)). 

Proof. The first equality is immediate. 

Consider the second one. Efe(||xfe+i — A||^_i) = A~^Efc(||x^*|^ — A^^^p) which coincides with 

Yn=i A~^(nll^i+i “ + (1 - - A(*)|P) and the second equality is proved. 

Similarly for the third equality, Eki\\yk+i - ^ll^-O = f^f^^kiWVkli - Y^^'>\\'^) and for every j, 

Mhi'h - = \\yi% - P(j G J(*fc+i)) + ||y« - P(j i j[ik+i)). 

P(j G J(4+i)) = P(*fe +1 e /(j)) = card(/(j))/n = 1/?^, implies that f) = i||yW^_y(i)||2 + 

(1 — ^)\\yk^ — and the third equality is proved. 

Consider the fourth equality. Note that 

n 

{yk+i-Y,M{xk+i-X))=Y, ^ (y«^_yO-),M,-,(x«,-A«)). 

i=l iGJ(i) 


For any pair (i,j) such that j G J{i), the conditional expectation of each term in the sum is equal to 
7 (yi+i - 7 ( 4+1 - X^^)) + (1 - - A(*))) which in turn implies the fourth 

equality in the Lemma. □ 

Assume that > Pi for each i G {1,..., n}. Define for every (x, y) G X x y, 


V{x,y) = ^\\x 


\U-i 


{y,Mx) + - I 


Lemma 3. Let Assumptions hold true. Suppose mi = • • • = mp = 1 and assume that Ti ^ > 

for each i G {1,... ,n}. Consider Algorithm^ and define for every A: G N, 


Sk,* = fixk) - fix*) - (V/(x*),Xfc - x*) . 


( 10 ) 


Then the following inequality holds: 

Efc [Sk+i,* + F(zfc+i,z*)] < (1 - -)5'fe * YV{zk,z*) - -V{zk+i,Zk) 

n n 


( 11 ) 


where Zk+i = {xk+i.yk+i)- 





Proof. Choosing Z = {X, Y) as in Lemmaj^ denoting Zk = {xk, yk) and Zk = {xk,yk)j we obtain immediately 
Ek(y{zk+i,Z)) = ^V{zk+i,Z) + (1 - ^)V{zk,Z) or equivalently, 

V{zk+i,Z) = nEkiVizk+i, Z)) - (n - l)V{zk, Z). (12) 

Let z* = (x*,?/*) e S. By Lemma[^ 

(V/(a;*) - Xf{xk),Xt - Xk+i) + V{zk+i,Zk) < V{zk,z^) - V{zk+i,z*). 

Identifying Z in © to z* and Zk successively, we obtain 

(V/(a:*) - Xf{xk),x^ - Xk+i) + nEfc(C(zfe+i,Zfe)) < nV{zk,z^) - nEfc(C(zfe+i, z*)). 

Dividing both sides of the above inequality by n and using that Xk+i = nEk{xk+i) — {n — l)xk, we obtain 

(V/(a;,) - Vf{xk),x^ - Ek{xk+i) + (1 - -)(a:fe - x*)) +Ek{V{zk+i, Zk)) < V{zk,z^.) - Ek{V{zk+i, z^)) . 

n 

Rearranging the terms. 


^/c+1 Xk) Y Y{^Zk^l-, Z.iff\ 

< --(V/(xfe) - V/(a:*),Xfe - xf) + R(zfc,z*) - Ek{V{zk+i, Zk)) ■ 
n 

We now use Assumption [I][^ , knowing that Xk+i only differs from Xk along coordinate ik+i 
fiXk + l) < /(Xfc) + {Vf{Xk),Xk + l - Xk) + ^^^^\\xk+l - Xfcll^ 


which implies that (V/(xfe),Xfc+i - Xk) > fixk+i) - f{xk) -^||xfc+i - Xfc|p. Thus, 

r 

Efe 


/(xfe+i) - f{xk) - ^^^^\\xk+i - Xfclp - (V/(x*),Xfe+i - Xk) + R(zfe+i,z*) 

1 


<-(V/(xfe) - V/(x*),Xfe - X*) + R(zfc,z*) - Efc(R(zfe+i,Zfc)). 


Introducing the quantity Sk,* as in (10), the inequality simplifies to 


El, 


Sk+i,* + V{zk+i, z*) — 


^^k+l II ||2 

-—\\Xk+l - Xk\\ 


< /(xfc) - /(x*) - (1 - -)(V/(x*),Xfe - X*) - -(V/(xfc),Xfe - X*) + R(zfc,z*) - Ek{V{zk+i,Zk)) ■ 
n n 

An estimate of the right-hand side is obtained upon noticing that (V/(xfe),Xfc — x*) > f(xk) — f{xt 
Therefore, 


E/c 


Sk+i,* -I- R(zfe+i,z*) - ^^^^||xfc+i - Xk 


< (1 - -)Sk,* + V{zk,z^) - Ek{V{zk+i,Zk)) ■ 

n 


Upon noting that ||xfc+i - Xk\\l-i - Pi^+^Xk+i - XkW^ = ||xfc+i - Xfc||^-i_^, we obtain 
Efe [Sk+i,* -I- U(zfc+i,z*)] < (1 - -b U(zfc,z*) - Ek{V{zk+i,Zk)) ■ 

Using Lemmaj^ it is immediate that Ek{V(zk+i, Zk)) = ^U(zfc-i-i, and the proof is complete. □ 


9 














Recall that we denote by p{A) the spectral radius of a matrix A. 

= rup = 1 and assume that the following condition holds for every 

1 


Lemma 4. Suppose that mi = 
i G 


Ti < 


(13) 


A + P (EjGJ(i) 

Then is a norm on X xy. 

Note that under the assumptions of Lemma is also, a fortiori, a norm. 

Proof. Let 7 “^ = — /3. Denote by D{a) the diagonal matrix on 3^ —>• 3^ defined by D{a){y) = 

{aiy^^\ ... ,apy^P^) for every y = ... ,y^P'>). We define ^( 7 ) similarly on X ^ X. By |H,T121 The¬ 

orem 7.7.6], a sufficient (and necessary) condition for F to be a squared norm is that D( 7 “^) M*D{a)M 
(where notation A y B means that A — B is a, positive definite matrix). Defining R = 

(that is, Rj^i = for every j,i), the condition reads equivalently p{R*R) < 1. As the set I{j) is 

reduced to a unique element for all j, the matrix R*R is (block) diagonal. Precisely, for any 1 < i,t < n, the 
(i, t!)-component {R*R)i i is zero whenever i ^ and is otherwise equal to {R*R)i^i = 7 ^ X)jG Ji^) ■ 

The condition p{R*R) < 1 yields jip ^ ^ {1; ■ ■ ■ which is in turns 

equivalent to condition (13). □ 


Proof of Theorem 1 in the case mi = ... = mp = 1. Let z* be an arbitrary point in S. Whenever condi¬ 
tion (13) is met, the r.v. D(zj,,z*) and V{zk+i, Zk) are non-negative. The r.v. is non-negative as well 
by convexity of /. We review two important consequences of Lemma 

• Define Uk = Sk,* -I- V{zk, z»). A first consequence of Lemma|^is that for all k, 

Efc(Dfc+i) < Uk - Sk,* ■ 

n 


Recalling that Uk and Sk are non-negative r.v., the Robbins-Siegmund Lemma |RS71j implies that almost 
surely, limfe_>oo Uk exists and J2k ^k,* < 00 . In particular, Sk,* converges almost surely to zero. By definition 
of Uk, this implies that limfe_,.oo V{zk,z*) exists almost surely. Following the argument of [Berlll Prop. 9] 
(see also [IBCH13] . |CP14bl Prop. 2.3]), this implies that there exists an event A of probability one such 
that for every a; G A and every z G S, limfc_>oo V^I'^[zk{pS) — z) exists. 

• A second consequence of Lemmais that, by taking the expectation E of both handsides of (11), 


£[5^+1,* + V{zk+i,z*)] < £[5^,* + V{zk,z*)] - E{V{zk+i,Zk)) 

n 


and by summing these inequalities, we obtain 0 < Sq,* + V(zo,z*) — ^Y^^=oE{V{zi+i, Zi)). This shows 
that £(X)fco -^i)) integrand is non-negative by Lemma It is therefore finite almost 

everywhere. In particular, the sequence V{zk+i,Zk) converges almost surely to zero. By Lemmag 
converges to zero almost surely. Say Zk+i{u!) — Zk(to) -G 0 for every oj G B where R is a probability event of 
probability one. 

We introduce the mapping T:Xxy^Xxy such that for any {x,y) G X xy, the quantity T{x,y) 
coincides with the couple {x, y) given by 


y = WOx^,h* {y + D{(t)Mx) 

X = prox,. g(a; - D{T)Vf{x) - D{t)M*{ 2y - y)) . 

With this definition, Sk+i = T{zk). By non-expansiveness of the proximity operator, it is straightforward to 
show that T is continuous. It is also straightforward to verify that its set of fixed points coincides with S. 

From now on to the end of this paragraph, we select a fixed uj G An B. Note that z/c(w) is a bounded 
sequence. Let z be a cluster point of the latter. We have shown that T(z/c(w)) — Zfe(w) —0 which implies 
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that T{z) — 5 = 0 by continuity of T. Thus, z G S. This implies that limfc_>.oo V^^^{zk{u}) — z) exists. Since 
V^^^{zk{uj) — z) tends to zero at least on some subsequence, we conclude that limfc_,.oo V'^/‘^{zk{oj) — z) = 0 . 
Otherwise stated, the sequence Zk(to) converges to some point z G S. This completes the proof of Theorem[^ 
in the case mi = • • • = rrip = 1 . □ 


3.3 General case 

For every j G {1,... ,p}, the space yj = is equipped with the inner product (it, v) = (“(0> 

We introduce the averaging operator Sj : yj —>■ yj defined for every u G yj by 

Sj{u) = — 
m,- 

^ i&IU) 


For any u G y^, we denote by O m = {u,... ,u) the vector of yj whose components all coincide with u. 
We introduce the linear operator Kj : X ^ yj by 

K,{x) = : * G /(j)) 

The operators S : y ^ y, K : X y are respectively defined by S{y) = , Sp{y^P'>)) and 

K{x) = {Ki{x),... ,Kp{x)). It is immediate to verify that 

M = D{m)SK (14) 


where m = (mi,... ,mp). In order to have some insights, the following example illustrates the construction 
of K for a given M. 

Example 1. Let T = and define M : X ^ y as the 3x3 matrix 


M = 


Mi,i 

A7i,2 

0 

A72,2 

,-^ 3,1 

-^3,2 



Here, 7(1) = {1,2} is the set of non-zero coefficients of the first row of M and it cardinal is mi = 2. Similarly 
m 2 = 1, m 3 = 3 and ^ = R®. Then : R^ —)■ R® coincides with the matrix 


M.l 

0 

0 \ 

0 

Mi^2 

0 

0 

M2,2 

0 

-^3,1 

0 

0 

0 

M^^2 

0 

V 0 

0 



and each row of K contains exactly one non-zero coefficient. On the other hand, S and D{m) respectively 
coincide with 



1 

2 

0 

0 

0 

o\ 


/2 

0 

0\ 

0 

0 

1 

0 

0 

0 

and 

D{m) = 0 

1 

0 

VO 

0 

0 

1 

3 

1 

3 



VO 

0 

y 


and obviously D{m)SK = M. 


We define the function h = ho (D{m)S). By (14), Problem (11 


is equivalent to 


min/(a:) -|- g{x) + h{Kx). 

x^X 


(15) 
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We denote by S the set of primal-dual solutions of the above problem i.e., the set of pairs (x», y*) S df x ^ 
satisfying 


0 G V/(a;*) -h dg{x^) + K*y^ 
0 G —Kx^ -I- dh*{y^). 


Substituting M with K, we may now apply Algorithm to (15). 
dj = mjaj and we define a G as the vector a = {ailmi, 

rrij whose components are all equal to one. Algorithm writes 
Initialization: Choose xq G A, G y. 

Iteration k: Define: 


For a fixed a = (cri,..., cXp), we define 
.., tfplmp) where Im^ is a vector of size 


yfc+i = prox. (y/^ + D(d-)Kxk) 

Xk+i = prox^_g(xfc - D(r)(V/(xfe) + K*{2y^^^ - y^))) . 
For i = i/j+i and for each j G J{ik+i), update: 


(i) 

"fc4 

. 0 ') 


ykiiii) = ykliii)- 


(16) 

(17) 


(18) 

(19) 


Otherwise, set x^^l^ = x^^\ yi+i(i) = yk\i)- 

Using the result of the Section 3.2 and the properties of K, the sequence (xfc, y^.) converges almost surely 
to a primal-dual point of Problem (15), provided that such a point exists and that the following condition 
holds: 

1 


n < 




'(s 


iG J(i) 


which is equivalent to Assumption [T||^. It remains to prove that the algorithm given by the iterations (16)- 
(19) coincides with Algorithm To that end, we need the following Lemma. 


Lemma 5. For any y €y, prox-(y) = {1^^ O prox^^]^* (-S'(y)), ■■■Ar. 
Proof. We have h{y) = h{miSi{y^^'>),... ,mpSp{y^P^)). Thus, 


‘Pi'Ox^l*('S'(y))) • 


h*{(f) = sup(¥>,y) - /i(TOiS'i(y(^)),.. .,mp5'p(y(P))) 
yey 


For all j G {1,.. ■ ,p}, denote by Cj the subset of yj formed by the vectors of the form {u,...,u) for some 
u G yj, and define C = Ci x • • • x Cp. Clearly, h {(f) = -|-oo whenever <-p and dh ((p) = 0 in that case. If 
on the other hand G C, one can write ip under the form p = (Imi 0 ■ ■ ■, Imp C) for some p € y. 

In that case. 


h*{p) = sup^(lm,- O P^^\ Imj 0 - h{miy^^\ ... ,mpy^^'>) 

y^yj=i 

P 

= sup N , rrijy^^'^) — h(miy^^\ ..., rripy^y^) =h*{p). 

y^yj=i 

Then, u G dh* (p) if and only if for every tp G y, h*{ip) > h*{p) + — p^^'^)) 

or equivalently, h*{ip) > h*{p) + ~ . Therefore, u G dh*(p) if and only if 

D{m)S{u) G dh*{p). 
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( 20 ) 


Now consider an arbitrary y G y and set q = prox- (y). This is equivalent to 


D{d- ^)(y -q) G dh*{q). 


In particular, q G dom{dh*) and thus q has the form q = (1^^ 0 ... ,\rnp ® for some q G y. 

The inclusion ( po| reads D{m)SD{d~^){y — q)) G dh*{q). Since D{m)SD{d~^) = D{a~^)S, we obtain 
D((7~^)(S(y) — q) G dh*{q) which is equivalent to g = prox^ (S'(y)). This completes the proof. □ 


The proof of the following Lemma is immediate. 

Lemma 6. For any y Gy, K*{y) = (EiGJ(i) ■ ■ ■ ,Y.jej{n) M*r,{y^^Hn))) . In particular, for 

any y Gy, 


We are now in a position to simplify the iterations (161-( 19). For every k, = (Imi ® yi+ij • ■ • > Imp ® 

y^k+i)- where we define = prox^_^* (5(2/)^+ -D(CT)iFxfe)). Upon noting that S'i!(CT)iF = D{a)D{m)SK = 
D{a)M, we obtain 

Vk+i = prox„,h» {zk + D{a)Mxk) (21) 

where we defined Zk = S{yk), otherwise stated, for each j G {1,... ,p}, 


U) _ 

k 


1 

rrij 


yk\i)- 

ieiU) 


Note that Zk+i differs from Zk only along the components j for which y^hii) differs from yk\i) for some i. 
That is, z^f^^ = z^^'^ for each j ^ J{ik+i) while for any j G J{ik+i), 

4+1 = 4 "^ + - yk\^k+i)) ■ ( 22 ) 

lllj 

Now consider equation 0- By Lemma § K*yk+i = M*yk+i. Thus, setting Wk = K*yk, equation 0 
simplifies to: 

Xk+i = proXpg(xk - D{T){Vf{xk) + (2M*2/fc+i - '“^fc))) • (23) 

By Lemma[^ again, Wk = (E^+JCi) Hl)> ■ • ■ > T,jeJ{n) ^(”'))- Therefore, Wk+i only differs from 

Wk along the i^+i-th coordinate and the update reads: 

Y. ^+..+i(yi+i(^fc+i)-yfe4*fe+i))- (24) 


Putting all pieces together, the update equations (211-(24) coincide with Algorithm [T| W e have thus proved 
that Algorithmis such that {xk,yk) converges to a primal-dual point of Problem (15) provided that such 
a point exists. To complete the proof, the final step is to relate the primal-dual solutions of Problem ( |I^ to 
the primal-dual solutions of the initial Problem 0- 


Consider the mapping G:Xxy^Xxy defined by G{x, y) = {x, (l^i C) y 


( 1 ) 


0 2 /(^))). 


Lemma 7. S = G{S). 

Proof. Let {x,y) G X xy and set y = (l^i 0 y^^\ ■ ■ ■, Imp ® 2/(^))- Then M*y = K*y, therefore 


0 G V/(a;) -I- dg{x) + K*y ^ 0 G V/(a;) -I- dg{x) + M*y . 
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Moreover, 


Q €—Kx-\-dh*{y) Kx G dh*(y) 

D(m)S{Kx) G dh*{y) 

Mx G dh*{y) 

where we used Lemma along with the identities D{m)SK = M and S{y) = y. The proof is completed 
upon noting that if {x, y) G S, then y has the form y = (1^1 ® y^^\ ■ • •, ® for some y Gy. □ 

We have shown that, almost surely, {xk,yk) converges to some point in G[S). This completes the proof 
of Theorem [T] 

4 Numerical experiments 

For all the experiments, we used one processor of a computer with Intel Xeon CPUs at 2.80GHz. 

4.1 Total Variation + regularized least squares Regression 

For given regularization parameters a > 0 and r G [0,1], we would like to solve the following regression 
problem with regularization given by the sum of Total Variation (TV) and the £i norm: 

min I \\Ax - b\\l + a{r ||x||^ + (1 - r) ||Mx ||2 ^ )■ 

The problem takes place on a 3D image of the brains of size 40 x 48 x 34. The optimization variable a; is a 
real vector with one entry in each voxel, that is n = 65,280. Matrix M is the discretized 3D gradient. This 
is a sparse matrix of size 195,840 x 65,280 with 2 nonzero elements in each row. The matrix A G ^'’68x65,280 
and the vector h G correspond to 768 labeled experiments where each line of A gathers brains activity 
for the corresponding experiment. Parameter r tunes the tradeoff between the two regularization terms. If r 
= 1, one gets a Lasso problem for which coordinate descent has been reported to be very efficient |FHHT07] . 
For r < 1, classical (primal) coordinate descent cannot be applied but primal-dual coordinate descent can. 

In this scenario, we set f{x) = ^ \\Ax — bjj^, g{x) = ar ||a:||^, h{y) = a(l—r) ||j /||2 i- We coded Algorithmj^ 
in CythoiPjand duplicated each dual variable two times. Note that as h = a(l — r) 11-112 ^ is not separable, 

^ - (z) ' 

we need to compute 12 dual components of yk+i for each primal variable updated and then use only 6 

of them to update for j G J{ik+i). We chose cxj such that p{J2j^j{i) is of the same order 

of magnitude as (3i and equal to 0.95 times its upper bound in AssumptionWe compared Algorithm 
against: 

• Vu-Condat’s algorithm [Vul3[ [Conl3] . 

• Chambolle-Pock’s algorithm [CPU) . 

• FISTA |BT09) with an inexact resolution of the proximal operator of TV, 

• L-BFGS |ZBLN97] with a smoothing of the nonsmooth functions and continuation. 

Figure indicates that our primal coordinate descent is a competitive algorithm for a wide range of regu¬ 
larization parameters. 

Note that Chambolle-Pock needs to compute the singular values decomposition of A (which explains the 
flat shape of the performance curve when the algorithm starts). FISTA and Vu-Condat need to estimate its 
largest singular value. If only a low accuracy is required. Algorithm may have reached this low accuracy 
even before these preprocessing steps are completed. 

^The code is available on http://perso.telecoin-parist6ch.fr/-ofercoq/Software.html 
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Figure 1: Comparison of algorithms for TV+Li-regularized regression at various regularization parameters 


L-BFGS has similar behaviour as Algorithm except for a = 0.1, r = 0.9 where it suffers from the 
non-smoothness of the objective while Algorithm deals with it directly by the proximal operators. FISTA 
is the fastest algorithm for strongly regularized problems. 

4.2 Linear Support Vector Machines 

We now present a second application for our algorithm. We consider a set of n observations gathered into 
a data matrix A £ and labels b £ M" and we intend to solve the following Support Vector Machine 

problem: 

” A 

1 “ bi{{A^w)i + Wo)) + - ||w||^ 

i—1 

As is common practice for this problem, we solve instead the Dual Support Vector Machine problem: 

1 ” 

max-— \\AD{b)x\\l + e^x - Ib^{x) 

^ i^l 

In the experiment^ we consider: 

• the RCVI dataset [LYRL04] where A is a sparse mxn matrix with m = 20,242, n = 47,236 and 0.157 
% of nonzero entries and we take Ci = ^ for all i and A = 

• the KDD cup 2009 dataset |GLB~*~0^ : the data is a mix of 14740 numerical values and 260 categorical 
values from Orange Labs. We preprocessed the data by adding a feature for each column containing 
missing values and binarizing the categorical values. We obtained a sparse matrix with m = 86,825, 
n = 50,000 and 1.79 % of nonzero entries. We divided the columns by their standard deviation 
and removed columns with a too small standard deviation. There are three tasks with this dataset: 
estimate the appetency, churn and up-selling probability of customers. As the classes are unbalanced, 

®Code available on https;//github. com/of ercoq/lightning 
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Figure 2: Comparison of dual algorithms for the resolution of linear SVM on the RCVl dataset. We report 
the value of the duality gap after a post-processing to recover feasible primal and dual variables. Primal 
variables are recovered as suggested in |SSZ13j and the intercept is recovered by exact minimization of the 
primal objective given the other primal variables. When dual iterates are not feasible, we project them 
onto the dual feasible set before computing the dual objective. We stopped each algorithm after 100 passes 
through the data: note that the cost per iteration of the three algorithms is similar. 


we compensate this with values of Ci proportional to the class weight and we chose max^ Ci = X = ^. 
We also chose a value of Ci depending on the class. 

Here f{x) = \ g{x) = ^lo,Ci]{xi)j Hu) = h-^iu) and M = I. We compare our 

method with 

• SCDA |SSZ13| : note that SDCA simply forgets Ib^{x) in order to be able to apply the classical 
coordinate descent method a thus will not converge to an optimal solution. 

• RCD |NP12| : at each iteration, the algorithm selects two coordinates randomly and performs a coor¬ 
dinate descent step according to these two variables. Updating two variables at a times allows us to 
satisfy the linear constraint at each iteration. 

We can see on Figures[^and[^the decrease of the SVM duality gap for each algorithm. SDCA is very efficient 
in the beginning and converges quickly. However, as the method does not take into account the intercept, 
it does not converge to the optimal solution and stagnates after a few passes on the data. Algorithm 
allows step sizes nearly as long as SDCA’s and taking into account the coupling constraint represents only 
marginal additional work. Hence, the objective value decreases nearly as fast for SDCA in the beginning 
without sacrificing the intercept, leading to a smaller objective value in the end. The RCD method of |NP12] 
does work but is not competitive in terms of speed of convergence. We also tried the C implementation of 
LIBSVM |CLllj but it needed 175s to solve the (medium-size) RCVl problem. 
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Figure 3: Comparison of dual algorithms for the resolution of linear SVM on the KDD cup 2009 dataset for 
the appetency, churn and up-selling tasks (one plot for each). We did the same post-processing as in Fig. 
We stopped each algorithm after 300 passes through the data. We can see here also that dealing with the 
intercept allows us to find more accurate solutions for a similar computational cost as with SDCA. 
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